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Abstract 

The regularization path of the Lasso can be 
shown to be piecewise linear, making it pos- 
sible to "follow" and explicitly compute the 
entire path. We analyze in this paper this 
popular strategy, and prove that its worst 
case complexity is exponential in the number 
of variables. We then oppose this pessimistic 
result to an (optimistic) approximate analy- 
sis: Wc show that an approximate path with 
at most 0{l/y/e) linear segments can always 
be obtained, where every point on the path 
is guaranteed to be optimal up to a relative 
e-duality gap. We complete our theoretical 
analysis with a practical algorithm to com- 
pute these approximate paths. 

1. Introduction 

Without a priori knowledge about data, it is often dif- 
ficult to estimate a model or make predictions, either 
because the number of observations is too small, or the 
problem dimension too high. When a problem solu- 
tion is known to be sparse, sparsity-inducing penalties 
have proven to be useful to improve both the quality 
of the prediction and its intepretability. In particu- 
lar, the ^i-norm has been used for that purpose in the 
Lasso formulation (Tibshirani, 1996). 

Controlling the regularization often requires to tune a 
parameter. In a few cases, the regularization path — 
that is, the set of solutions for all values of the regular- 
ization parameter, can be shown to be piecewise lin- 
ear (Rosset & Zhu, 2007). This property is exploited 
in homotopy methods, which consist of following the 
piecewise linear path by computing the direction of the 
current linear segment and the points where the direc- 
tion changes (also known as kinks) . Piecewise linearity 
of regularization paths was discovered by Markowitz 
(1952) for portfolio selection; it was similarly exploited 
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by Osborne et al. (2000) and Efron et al. (2004) for 
the Lasso, and by Hastie et al. (2004) for the support 
vector machine (SVM). As observed by Gartner et al. 
(2010), all of these examples are in fact particular in- 
stances of parametric quadratic programming formula- 
tions, for which path-following algorithms appear early 
in the optimization literature (Ritter, 1962). 

In this paper, we study the number of linear segments 
of the Lasso regularization path. Even though expe- 
rience with data suggests that this number is linear 
in the problem size (Rosset & Zhu, 2007), it is known 
that discrepancies can be observed between worst-case 
and empirical complexities. This is notably the case 
for the simplex algorithm (Dantzig, 1951), which per- 
forms empirically well for solving linear programs even 
though it suffers from exponential worst-case complex- 
ity (Klee & Minty, 1972). Similarly, by using geomet- 
rical tools originally developed to analyze the simplex 
algorithm, Gartner et al. (2010) have shown that the 
complexity of the SVM regularization path can be ex- 
ponential. However, to the best of our knowledge, 
none of these results do apply to the Lasso regulariza- 
tion path, whose theoretical complexity remains un- 
known. The goal of our paper is to fill in this gap. 

Our first contribution is to show that in the worst-case 
the number of linear segments of the Lasso regulariza- 
tion path is exactly {3^ + l)/2, where p is the number 
of variables (predictors). Wc remark that our proof is 
constructive and significantly different than the ones 
proposed by Klee & Minty (1972) for the simplex algo- 
rithm and by Gartner et al. (2010) for SVMs. Our ap- 
proach does not rely on geometry but on an adversarial 
scheme. Given a Lasso problem with p variables, we 
show how to build a new problem with p variables 
increasing the complexity of the path by a multiplica- 
tive factor. It results in explicit pathological examples 
that are surprisingly simple, unlike pathological exam- 
ples for the simplex algorithm or SVMs. 

Worst-case complexity analyses are by nature pes- 
simistic. Our second contribution on approximate reg- 
ularization paths is more optimistic. In fact, we show 
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that an approximate path for the Lasso with at most 
0(l/-y/£) segments can always be obtained, where ev- 
ery point on the path is guaranteed to be optimal up to 
a relative e-duality gap. We follow in part the method- 
ology of Giesen et al. (2010) and Jaggi (2011), who 
have presented weaker results but in a more general 
setting for parameterized convex optimization prob- 
lems. Our analysis builds upon approximate optimal- 
ity conditions, which we maintain along the path, lead- 
ing to a practical approximate homotopy algorithm. 

The paper is organized as follows: Section 2 presents 
some brief overview of the Lasso. Section 3 is devoted 
to our worst-case complexity analysis, and Section 4 
to our results on approximate regularization paths. 

2. Background on the Lasso 

In this section, we present the Lasso formulation 
of Tibshirani (1996) and well known facts, which we 
exploit later in our analysis. For self-containcdness 
and clarity reasons we include simple proofs of these 
results. Let y be a vector in R" and X = [x^, . . . , x^"] 
be a matrix in R"^p. The Lasso is formulated as: 

min i||y-Xwl|^ + A||w||i, (1) 

where the £i-norm induces sparsity in the solution w 
and A > controls the amount of regularization. Under 
a few assumptions, which are detailed in the sequel, 
the solution of this problem is unique. We denote it 
by w'^(A) and define the regularization path V as the 
set of all solutions for all positive values of A:^ 

V = {w*(A) : A > 0}. 

The following lemma presents classical optimality 
and uniqueness conditions for the Lasso solution (see 
Fuchs, 2005), which are useful to characterize V: 
Lemma 1 (Optimality Conditions of the Lasso). 

A vector w* in W is a solution of Eq. (1) if and only 
if for all j in {1, . . . ,p}, 

x^'^ (y - Xw'^) = A sign(w*) w* ^ 0, 

(2) 

Ix-* (y — Xw*)| < A otherwise. 

Define J ^ {j G {1,... ,p} : |x^T(y - Xw^^)! = A}. 
Assuming the matrix Xj = [x-'Jjgj to be full rank, the 
solution is unique and we have 

w} = (XjX,/)-i(Xjy-A77,;), (3) 

w/iere J7 = sign(X^(y — Xw*)) is m{ — 1;0;+1}^, and 
the notation Uj for a vector u denotes the vector of 
size \J\ recording the entries of u indexed by J. 

^For technicality reasons, we enforce A>0 even though 
the limit w*(0''') = lim_^^_^Q+ w*(A) may exist. 



Proof. Eq. (2) can be obtained by considering sub- 
gradient optimality conditions. These can be writ- 
ten as e {-X^(y - Xw*) + Ap : p e 9||w*||i}, 
where 9||w*||i denotes the subdifferential of the £i- 
norm at w*. A classical result (Borwein & Lewis, 
2006) says that the subgradients p are the vectors 
in M.P such that for all j in {1, . . . ,p}, Pj = sign(w*) 
if w* 7^ 0, and |pj| < 1 otherwise. This gives Eq. (2). 
The equalities in Eq. (2) define a linear system that has 
a unique solution given by (3) when Xj is full rank. 

Let us now show the uniqueness of the Lasso solution. 
Consider another solution w'* and choose a scalar 9 in 
(0, 1). By convexity w''* = 9w* + (1 - 0)w'* is also 
a solution. For all j ^ J, we have |x^^(y — Xw^*)| < 
6i|xJ^(y--Xw*)|-h(l-6l)|xJ^(y-Xw'*)| < A. Com- 
bining this inequality with the conditions (2), we nec- 
essarily have Vif^c = w*(; = 0,^ and the vector is 
also a solution of the following reduced problem: 

min , ^l|y - X,/w||2 + A||w||i. 

When Xj is full rank, the Hessian X^Xj is positive 
definite and this reduced problem is strictly convex. 
Thus, it admits a unique solution w^* = w^. It is 
then easy to conclude that w* = w^* = w'*. □ 

With the assumption that the matrix Xj is always full- 
rank, we can formally recall a well-known property of 
the Lasso (see Markowitz, 1952; Osborne et al., 2000; 
Efron et al., 2004) in the following lemma: 

Lemma 2 (Piecewise Linearity of the Path). 

Assume that for any A > and solution of Eq. (1) the 
matrix Xj defined in Lemma 1 is full-rank. Then, the 
regularization path {w'^(A) : A > 0} is well defined, 
unique and continuous piecewise linear. 

Proof. The existence/uniqueness of the regularization 
path was shown in Lemma 1. 

Let us define {'n*[X) = sign(w'^(A)) : A > 0} the set of 
sparsity patterns. Let us now consider Ai < A2 such 
that r7*(Ai) = J7*(A2). For aU 6 e [0, 1], it is easy to 
see that the solution vf''* = 6'w*(Ai) (1 - 6')w*(A2) 
satisfies the optimality conditions of Lemma 1 for A = 
e'Ai + (l-6l)A2, and that w*(e'Ai + (1 -6i)A2) = w'^*. 

This shows that whenever two solutions w*(Ai) 
and w*(A2) have the same signs for Ai 7^ A2, the regu- 
larization path between Ai and A2 is a linear segment. 
As an important consequence, the number of linear 
segments of the path is smaller than 3^, the number of 
possible sparsity patterns in { — 1, 0, 1}''. The path V is 
therefore piecewise linear with a finite number of kinks. 

^ denotes the complement of the set J in {1, . . . ,p}. 
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Moreover, since the function A w*(A) is piecewise 
hnear, it is piecewise continuous and has right and 
left hmits for every A > 0. It is easy to show that 
these hmits satisfy the optimahty conditions of Eq. (2). 
By uniqueness of the Lasso solution, they are equal 
to w*(A) and the function is in fact continuous. □ 

Assuming again that Xj is always full rank, we 
can now present in Algorithm 1 the homotopy 
method (Osborne et al, 2000; Efron et al., 2004). 

Algorithm 1 Homotopy Algorithm for the Lasso. 
1: Inputs: a vector y in R"; a matrix X in R"'^^; 
2: initialization: set A to ||X^y||oo; we have 

w*(A) ~ (trivial solution); 
3: set J = {jo} such that Ix-^'^^yl = A; 
4: while A > do 
5: Set T] ^ sign(X^(y-Xw*(A)); 
6: compute the direction of the path: 

r w}(A) = (XjXj)-i(Xjy-Ar7,,) 
I w*,(A) = 0. 

7: Find the smallest step r > such that: 

• there exists j e such that 
|xJ^(y-Xw*(A-r))| A-t; add j to J; 

• there exists j in J such that w*(A) 7^ and 
w*(A — t) = 0; remove j from J; 

8: replace A by A — r; record the pair (A, w'^(A)); 
9: end while 

10: Return: sequence of recorded values (A,w*(A)). 



It can be shown that this algorithm maintains the opti- 
mahty conditions of Lemma 1 when A decreases. Two 
assumptions have nevertheless to be made for the al- 
gorithm to be correct. First, (XjXj) has to be in- 
vertible, which is a reasonable assumption commonly 
made when working with real data and when one is in- 
terested in sparse solutions. When (XjXj) becomes 
ill-conditioned, which may typically occur for small 
values of A, the algorithm has to stop and the path is 
truncated. Second, one assumes in Step 7 of the algo- 
rithm that the value t corresponds to a single event 
|xJ'^(y-Xw*(A-T))| = A-T for j in J^^ or w*(A-t) hits 
zero for j in J. In other words, variables enter or exit 
the path one at a time. Even though this assumption 
is reasonable most of the time, it can be problematic 
from a numerical point of view in rare cases. When 
the length of a linear segment of V is smaller than the 
numerical precision, the algorithm can fail. In con- 
trast, our approximate homotopy algorithm presented 
in Section 4 is robust to this issue. In the next sec- 
tion, we present our worst-case complexity analysis of 
the regularization path, showing that Algorithm 1 can 
have exponential complexity. 



3. Worst-Case Complexity 

We denote by {r7*(A) = sign(w*(A)) : A > 0} the set 
of sparsity patterns in { — 1,0, 1}^ encountered along 
the path V. We have seen in the proof of Lemma 2 
that whenever ?7'^(Ai) = ''?'*(A2), for Ai,A2 > 0, then 
ri*{X) — ri*{Xi) for all A S [Ai, A2], and thus the num- 
ber of linear segments of V is upper-bounded by 3^. 
With an additional argument, we can further reduce 
this number, as stated in the following proposition: 

Proposition 1 (Upper-bound Complexity). 

Let assume the same conditions as in Lemma 2. The 
number of linear segments in the regularization path of 
the Lasso is less than (3^ -I- l)/2. 

Proof. We have already noticed that the number of lin- 
ear segments of the path is at most 3^. Let us consider 
r7*(Ai)7^0 for Ai >0. We now show that for all A2 >0, 
we have r7'^(A2) 7^— J7*(Ai), and therefore the number 
of different sparsity patterns on the path V is in fact 
less than or equal to (3p + 1)/2. 

Let us assume that there exists A2 > with r]*{X2) ~ 
— rj*(Ai), and look for a contradiction. We define the 
set J' = {j g {1, ■ . • ,p} : '7j(Ai) ^ 0}, and consider 
the solution of the reduced problem for all A > 0: 

w*(A) = argmin^lly - Xj'w||2 + A||w||i, 

which is well defined since the optimization problem 
is strictly convex (the conditions of Lemma 2 imply 
that Xj/ is full rank). We remark that w*(Ai) = 
Wj, (Al), and w*(A2) = Wj, (A2). Given the optimal- 
ity conditions of Lemma 1, it is then easy to show 
that w*(0) = (Xj,X,;,)-iXj,y = ^w-(Ai) + 

j^qj^w*(A2). Since the signs of w'^(Ai) and w*(A2) 
are opposite to each other and non-zero, we have 
||w'^(0)||i < ||w*(Ai)||i. Independently, it is also easy 
to show that the function A ||w*(A)||i should be 
non-increasing, and we obtain a contradiction. □ 

In the next proposition, we present our adversarial 
strategy to build a pathological regularization path. 
Given a Lasso problem with p variables and a path V, 
we design an additional variable along with an extra 
dimension, such that the number of kinks of the new 
path V increases by a multiplicative factor compared 
to v. We call our strategy "adversarial" since it con- 
sists of iteratively designing "pathological" variables. 

Proposition 2 (Adversarial Strategy). 

Let us consider y in R" and X in SJ^^p such that the 
conditions of Lemma 2 are satisfied and y is in the 
span of X. We denote by V the regularization path 
of the Lasso problem corresponding to (y, X), by k the 
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number of linear segments of V , and by Xi > the 
smallest value of the parameter A corresponding to a 
kink of V ■ We define the vector y in and the 

matrix ± m M("+i)><(p+i) as follows: 





y 











X 





2ay 

ayn+i 



where y„+i ^ and < a < Ai/(2y^y + y^+i). 

Then, the regularization path V of the Lasso problem 
associated to (y, X) exists and has 3/c— 1 linear seg- 
ments. Moreover, let us consider {rj^ = 0, rf' , . . . , r]^} 
the sequence of sparsity patterns in {— 1,0, 1}^ of V 
(the signs of the solutions w*(A)J, ordered from large 
to small values of X. The sequence of sparsity patterns 
in {—1, 0, of the new path V is the following: 



first k patterns 



middle k patterns 



771= 0' 

1 







_-^3 








1 




1 


, . . . , 


1 


J 











(4) 



last k—1 patterns 



Let us first make some remarks about this proposition: 

• According to Eq. (4) the sparsity patterns of the 
new path V are related to those of V. More precisely, 
they have either the form [?7'^,0]^ or [±77*^,!]^, 
where tj' is a sparsity pattern in {— 1, 0, 1}^ of 7^. 

• The last column of X involves a factor a that 
controls its norm. With a small enough, the (p+l)-th 
variable enters late the path V. As shown in Eq. (4), 
the first k sparsity patterns of V do not involve this 
variable and are exactly the same as those of V. 

• Let us give some intuition about the pathological 
behavior of the path V. The first k kinks of V are the 
same as those of V, and after these first k kinks we 
have y « Xw*(A). Then, the (p+l)-th variable enters 
the path and we hcuristically have 



w*(A) 






Vn+l 



X 



-w^(A) 
1/a 



(5) 



The left side of Eq. (5) tells us that when the (p+l)- 
th variable is inactive, the coefficients associated to 
the first p variables should be close to w'^(A). At the 
same time, the right side of Eq. (5) tells us that when 
the (p+l)-th variable is active, these same p coeffi- 
cients should be instead close to — w*(A). According to 
Eq. (4) , the signs of these p coefficients along the path 
switch from r;'^ =sign(w*(A)) to — rj*^ by following the 



sequence rf^ ,7]^ ^ , 



..,(771=0 =-r,i),-r,2 



resulting in a path with 3fc— 1 linear segments. The 
proof below more rigorously describes this strategy: 



Proof. Existence of the new regularization path: 

Let us rewrite the Lasso problem for (y, X). 



1 

mm — 



W 

w 



+ A 



w 



. min i||(l-2aiI;)y-Xw||^+i(y„+i-ay„+iw) 
weRs'.ii'eR 2 2 



+ A||w||i+A|w|. (6) 

Let (w*,?Zi*) be a solution for a given A > 0. By 
fixing w = w* m. Eq. (6) and optimizing with respect 
to w, we obtain an equivalent problem to (6): 



min — 1 1 y — Xw' 1 1 9 + r 

w'eRp2"-^ |l-2aui*| 



w'lli, 



with the change of variable w = (1 — 2q;w*)w' and 
assuming 1 — 2aiv* ^ 0. The solution of this problem 
is unique since it is a point of V and we therefore have 

[ otherwise 

(7) 

Since the last column of X is not in the span of the 
first p columns by construction of X, it is then easy 
to see that the conditions of Lemma 2 are necessarily 
satisfied and therefore (w*, w*) is in fact the unique 
solution of Eq. (6). Since this is true for all A > 0, 
the regularization path is well defined, and we denote 
from now on the above solutions by w*(A) and 'w*{X). 

MELximum number of linear segments: 

We now show that the number of linear segments of 
the path is upper-bounded by 3A:— 1. Eq. (7) shows that 
sign(w*(A)) has the form ±77*, where 77' in {—1,0, 1}^ 
is one of the k sparsity patterns from V , whereas 
we have three possibilities for sign(w*(A)), namely 
{ — 1, 0, +1}. Since one can not have two non-zero spar- 
sity patterns that are opposite to each other on the 
same path, as shown in the proof of Proposition 1, the 
number of possible sparsity patterns reduces to S/c— 1. 

Characterization of the first k linear segments: 

Let us consider A > Ai and show that w*(A) = w*(A) 
and zi;*(A) = by checking the optimality conditions of 
Lemma 1. The first p equalities/inequalities in Eq. (2) 
are easy to verify, the last one being also satisfied: 



|2ayT(y-Xw*(A))+ay2^i| < 2a\\y\\l 



where the last inequality is obtained from the defi- 
nition of a. Since this inequality is strict, this also 
ensures that there exists < Aj < Ai such that 
w*(A) = w*(A) and w*{X) = for aU A > A^. We 
have therefore shown that the first k sparsity patterns 
of the regularization path are given in Eq. (4). 
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Characterization of the last 2fc 1 segments: 

Wc mainly use here the form of Eq. (7) and a fcw 
continuity arguments to characterize the rest of the 
path. First, we remark that for all /? in [0, there 
exists a value for A > such that ?5)*(A) = /?. This is 
true because: (i) X^w*{X) is continuous; (ii) ■w*{Xi) = 
0; (in) w*{0+) = ^. Point (i) was shown in Lemma 2, 
point (ii) in the previous paragraph, and point (iii) is 
necessary to have the term (y„+i— ay„+itl;)^ in Eq. (6) 
go to when A goes to 0^. 

We now consider two values A'^ , Aj > such that 
w*iX[) = 0, w*{X'^) = ^ and w*iX) e (0,5^) for 
all A G (Ai,A2). On this open interval, we have that 
(1 — 2a'u;'*(A)) > 0, and the continuous function A — ?> 
A/|(l — 2aw*{X))\ ranges from A'^ to +00. Combining 
this observation with Eq. (7), we obtain that all spar- 
sity patterns of the form [ry*^, 1]^ for i in {1, . . . , fc} 
appear on the regularization path. With similar con- 
tinuity arguments, it is easy to show that all sparsity 
patterns of the form [— t?'^, 1]^ for i in {1, ... , fc} ap- 
pear on the path as well. 

We had previously identified fc of the sparsity patterns, 
and now have identified 2fc— 1 different ones. Since we 
have at most 3fc— 1 linear segments, the set of sparsity 
patterns on the path V is entirely characterized. The 
fact that the sequence of sparsity patterns is the one 
given in Eq. (4) can easily be shown by reusing similar 
continuity arguments. □ 

With this proposition in hand, we can now state the 
main result of this section: 

Theorem 1 (Worst-case Complexity). 

In the worst case, the regularization path of the Lasso 
has exactly (3^ + l)/2 linear segments. 

Proof. We start with n = p = 1, and define y = [1], 
and X = [1], leading to a path with fc = 2 segments. 
We then recursively apply Proposition 2, keeping n=p, 
choosing at iteration p + I, yp+i = 1, and a fac- 
tor a = oip+i satisfying the conditions of Proposition 2. 
Denoting by kp the number of linear segments at iter- 
ation p, we have that fcp+i =3fcp — 1, and it is easy to 
show that fcp = (3P-|-l)/2. According to Proposition 1, 
this is the longest possible regularization path. Note 
that this example has a particularly simple shape: 
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Figure 1. Pathological regularization path with p = 6 vari- 
ables and (3'' + l)/2 = 365 kinks. The curves represent the 
values of the coefficients at every kink of the path. For vis- 
ibility purposes, we use a non-linear scale and report the 
values sign(w)|«;|'''^ for a coeflicient w. Best seen in color. 

3.1. Numerical Simulations 

We have implemented Algorithm 1 in Matlab, opti- 
mizing numerical precision regardless of computational 
efficiency, which has allowed us to check our theoreti- 
cal results for small values of p. For instance, we ob- 
tain a path with {3^ + l)/2 = 88 574 linear segments 
for p = 11, and present such a pathological path in 
Figure 1. Note that when p gets larger, these exam- 
ples quickly lead to precision issues where some kinks 
are very close to each other. Our implementation and 
our pathological examples will be made publicly avail- 
able. In the next section, we present more optimistic 
results on approximate regularization paths. 

4. Approximate Homotopy 

We now present another complexity analysis when ex- 
act solutions of Eq. (1) are not required. We follow in 
part the methodology of Giesen et al. (2010), later re- 
fined by Jaggi (2011), on approximate regularization 
paths of parameterized convex functions. Their re- 
sults are quite general but, as we show later, we obtain 
stronger results with an analysis tailored to the Lasso. 

A natural tool to guarantee the quality of approximate 
solutions is the duality gap. Writing the Lagrangian of 
problem (1) and minimizing with respect to the primal 
variable w yields the following dual formulation of (1): 

max K^K — K^y s.t. j|X^K:||oo < A, (8) 

K,€M" 2 

where k in is a dual variable. Let us denote 
by /a(w) the objective function of the primal prob- 
lem (1) and by g\{K) the objective function of the 
dual (8). Given a pair of feasible primal and dual vari- 
ables (w, k), the difference (5a (w, k) = /a(w) — g\{K) 
is called a duality gap and provides an optimality guar- 
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antee (see Borwein & Lewis, 2006): 

< /a(w) - ^(^^(A)) < <5a(w,/«). 

In plain words, it upper bounds the difference between 
the current value of the objective function /a(w) and 
the optimal value of the objective function /a(w*(A)). 
In this paper, we use a relative duality gap criterion 
to guarantee the quality of an approximate solution:'^ 
Definition 1 (e-approximate Solution). 
let e be in [0, 1]. A vector w in MP is said to be an e- 
approximate solution of problem (1) if there exists n 
in such that ||X^/t||oo<A and S\{w,K)<ef\{w). 

Given a set V = {w(A) G M-p : A > 0}, we say that V is 
an e-approximate regularization path if any point w(A) 
ofV is an e-approximate solution for problem (1). 

Our goal is now to build e-approximate regularization 
paths and study their complexity. To that effect, we 
introduce approximate optimality conditions based on 
small perturbations of those given in Lemma 1 : 

Definition 2 {OPTx{ei,e2) Condition). 

Let £1 > and e2 > —ei . A vector w in MP satisfies the 
OPTx{ei, e2) condition if and only if for all l<j<p, 

A(l-e2)<x^^(y-Xw)sign(w,)<A(l + ei) i/w^-^O, 

|x''^(y — Xw)| < A(l + £i) otherwise. 

(9) 

Note that when £i = £2 = 0, this condition reduces to 
the exact optimality conditions of Lemma 1. Of inter- 
est for us is the relation between Definitions 1 and 2. 
Let us consider a vector w such that 0PTx{ei,e2) is 
satisfied. Then, the vector k = ^^^^^ (Xw — y) is feasi- 
ble for the dual (8) and we can compute a duality gap: 

Sx{vf, k) = f\{vf) - gx{K) 

= ^(1 + eifn'^K + A|lw|li + ^K^K + n'^y 

2 

= K + X\\w\\i + (y + {1 + ei)K 

_ ^1 -^Ik, V,„l|2 , 



(1 -l-ei)'^ 2 

From Eq. (9), it is easy to show that A|| w|| i + /t^Xw < 
^i^^^ A||v^f|| 1, and we can obtain the following bound: 



(5a (w, k) < max 



ei +e2 



(l + £i)2' 1 + 



/a(w). (10) 



From this upper bound, we derive our first result: 



^Note that our criterion is not exactly the same as 
in Jaggi (2011). Whereas Jaggi (2011) consider a formula- 
tion where the £i-norm appears in a constraint, Eq. (1) in- 
volves an £i-penalty. Even though these formulations have 
the same regularization path, they involve slightly different 
objective functions, dual formulations, and duality gaps. 



Proposition 3 (Approximate Analysis). 

Let y be in M" and X in W^^'p such that the condi- 
tions of Lemma 2 are satisfied. Let Aoo = ||X^y||oo be 
the value of A corresponding to the start of the path, 
and Ai > be the one corresponding to the last kink. 
For all eG (0, 1), there exists an e-approximatc regular- 



ization path with at most 



log(Aoo/Ai) 



linear segments. 



Proof. From Eq. (9), one can show by a simple cal- 
culation that an exact solution w*(A) for a given A 
satisfies OPrx(i_£3)(e3/(l-£3), -£3/(1-63)). Accord- 
ing to Eq. (10), there exists a dual variable k such 
that 5\(i-^^){'w*{\),n) <£§. Thus, for any A' chosen 
in [A, A(l— ^/e)], the solution w'^(A) is an £-approximate 
solution for the parameter A'. Between Aoo and Ai, we 
can obtain an £-approximate piecewise linear (in fact 
piecewise constant) regularization path by sampling 
solutions w*(A) for A in {Aoo, Aoo (1 — %/£"), • • • , Aoo(l — 
V£)'',Ai} with Aoo(l-v^)''+^ < Ai. The number of 
segments of the corresponding approximate path is at 



most 



-log(A^/Ai) 
log(l-v^) 



Note that the term Aoo/Ai is possibly large, but it is 
controlled by a logarithmic function and can be con- 
sidered as constant for finite precision machines. In 
other words, the complexity of the approximate path 
is upper-bounded by 0{l/\/e). In contrast, the anal- 
ysis of Giesen et al. (2010) and Jaggi (2011) give us: 

• an approximate path with 0(l/£) linear segments 
can be obtained with a weaker approximation guaran- 
tee than ours. Namely, a bound 5<e along the path, 
where 5 is a duality gap, whereas we use relative du- 
ality gaps of the form 5 <ef\{'w)\'^ Interestingly, this 
bound is proven to be optimal in the context of param- 
eterized convex functions on the ^i-ball. Our result 
show that such bound can be improved for the Lasso. 

• a methodology to obtain relative duality gaps 
along the path, which can easily provide complexity 
bounds for the full path of different problems, notably 
support vector machines, but not for the Lasso. 

Proposition 3 is optimistic, but not practical since it 
requires sampling exact solutions of the path V. We 
introduce an approximate homotopy method in Algo- 
rithm 2 which does not require computing exact solu- 
tions and still enjoys a similar complexity. It exploits 
the piecewise linearity of the path, but uses a first- 
order method (Beck & Teboulle, 2009; Fu, 1998) when 
the linear segments of the path are too short. 

^When there exists m, M > such that m< f\< M, the 
relative duality gap guarantee is similar (up to a constant) 
to the simple bound 5 < e. However, we have for the Lasso 
that /a(w*(A)) — > when A goes to 0"*", as long as y is in 
the span of X. Note that as noticed in footnote 3, Jaggi 
(2011) uses a slightly different duality gap than ours. 



Complexity Analysis of the Lasso Regularization Path 



Algorithm 2 Approximate Homotopy for the Lasso. 
1: Inputs: a vector y in M", a matrix X in R"^p, 
the required precision e e [0, 1]; Ai > 0; 
initialization: set A to j|X^y||oo; set w(A) = 0; 
set e = l + e/2- 

set J ^ {jo} such that |x-'"^y| = A; 
while A > Ai do 

if (XjXj) is not invertiblc then go to 12; 
set77^(l/A)XT(y-Xw(A)); 
compute the approximate direction of the path: 

wj(A) = (XjX,7)-i(Xjy-Ai7,7) 
wjc(A) = 0. 

Find the smallest step r > such that: 

• there exists j in such that 
|xJ"T(y-Xw(A-T))| = (A-T)(l + §); add j to J; 

• there exists j in J such that Wj(A) ^ and 
Wj(A — r) = 0; remove j from J; 

9: if r > Xe^/e then 
10: replace A by A — r; 
11: else 

12: replace A by A(l - 6*^); 

13: use a first-order optimization method to find 

a solution w(A) satisfying OPTxie/2,e/2); 
14: set J={jG{l,...,p}:Wj(A)^0}. 
15: end if 

16: record the pair (A,w(A)); 
17: end while 

18: Return: sequence of recorded values (A, w(A)). 



Note that when e = 0, Algorithm 2 reduces to Algo- 
rithm 1. Our approach exploits the following ideas, 
which we formally prove in the sequel. Assume 
that w(A) satisfies OPTx{e /2,e/2). Then, 

• w(A) is an e-approximation for all A' in [A, A(l — 
6y^)]. This guarantees us that one can always make 
step sizes for A greater than or equal to X9y/e; 

• the direction followed in Step 8 maintains 
OPT\{e/2,e/2), but when two kinks are too close to 
each other — that is, r < Afl-y/e, we directly look for a 
solution for the parameter A' = X{l — 9^/£) that sat- 
isfies OPT\i{e/2,e/2). Any first-order method can 
be used for that purpose, e.g., a proximal gradient 
method (Beck & Teboulle, 2009), using the current 
value w(A) as a warm start. 

Note also that when (XjXj) is not invertible, the 
method uses first-order steps. The next proposition 
precisely describes the guarantees of our algorithm. 

Proposition 4 (Analysis of Algorithm 2). 

Let y be in R" and X in M"^p. For all Xi>0 and eG 
(0, 1), Algorithm 2 returns an e-approximate regular- 
ization path on [Aoo, Ai]. Moreover, it terminates in at 
most 



Proof. We first show that any solution on the path 
is an e-approximate solution. First, it is easy to 
check that 0PT\{e/2, e/2) is always satisfied at Step 6. 
This is either a consequence of Step 13, or be- 
cause the direction wj(A') = (XjXj)~^(Xjy— A'?7j) 
maintains OPT\r{e/2,e/2) when A' varies between A 
and X~T. From Eq. (10), we obtain that w(A) is an e- 
approximate solution whenever OPT\{e/2, e/2) is sat- 
isfied. Thus, we only need to check that w(A) is also 
an £-approximate solution for A' in [A, A(l — 9^/e)]: 
for £3 > 0, it is easy to check that OPT\{e/2, e/2) im- 

plies OPTA(i_^3)((£/2-fe3)/(l-e3), (e/2-e3)/(l-£3)). 
Setting 63 = Qy/e and using Eq. (10), it is possible to 
show that the desired condition is satisfied. 

Since the step size for A is always greater than X9^/e, 
the maximum number of iterations is upper-bounded 



by 



-log(A„e/A 

log(i-ev^) 



1)1 
1 J 



1 < 



log(Aoo/Ai) 



□ 



We remark that the scalar 9 is very close to 1 and 
therefore the complexity is similar to the one of Propo- 
sition 3, with a logarithmic function controlling the 
possibly large term Aoo/Ai. This algorithm is practi- 
cal in different aspects: (i) it is almost as simple to 
implement as the homotopy method; (ii) it is robust 
to cases where two kinks are too close for the classical 
homotopy method to work; (iii) it provides optimality 
guarantees along the path; (iv) whenever possible, it 
explicitly exploits the piecewise linearity of the path. 
We next present experiments to verify our analysis. 

4.1. Numerical Simulations 

We have implemented Algorithm 2 with a few mod- 
ifications to the code used in Section 3.1. The inner 
solver is a coordinate descent algorithm (see Fu, 1998), 
with a stopping criterion based on Definition 2. 

We consider 4 datasets. The first one dubbed SYNTH 
consists of a pure noise fitting scenario with no statisti- 
cal meaning. The entries of the corresponding vector y 
and matrix X are i.i.d. draws from a standard normal 
distribution. The next dataset is called PATHOL and 
is a pathological example obtained from the analysis of 
Section 3. Finally, we consider two datasets based on 
real data, respectively dubbed MADELON^ and PC- 
MAC^. For each dataset, we center and normalize the 
columns of X and the vector y, and choose the param- 
eter Ai corresponding to the last kink of the true path. 

For all datasets, we compute the full regularization 
path using Algorithm 1 and several e-approximate 
regularization paths using Algorithm 2. Note that 



log(Aoc/Ai) 



iterations, where X^ 



http: //www .nipsf sc . ecs . soton. ac .uk/datasets/. 
"^http: //f eatureselection. asu. edu/datasets .php. 



Complexity Analysis of the Lasso Regularization Path 



Table 1. Complexity results of e-approximated regulariza- 
tion paths for four datasets with n observations and p vari- 
ables. The number of linear segments is denoted by k. 





SYNTH 


PATHOL 


MADELON 


PCMAC 


n 


1100 


11 


2 000 


1943 


P 


1000 


11 


500 


3 289 


k, full path 


1615 


88 574 


517 


2 561 


k, £ = 10"'' 


1297 


2 744 


468 


1254 


k, £ = 10"* 


686 


1071 


327 


444 


k, e = 10"^ 


268 


405 


152 


155 


k, e = 10"^ 


96 


146 


61 


53 


k, e = 0.1 


34 


51 


22 


18 


k, £ = 0.25 


21 


32 


15 


11 


k, £ = 0.5 


14 


20 


10 


7 



the path of PCMAC was stopped around A w 10""' 
where the matrix XjXj became ill-conditioned and 
the Lasso solution dense. As a simple sanity check, we 
first experimentally verify the correctness of Proposi- 
tions 3 and 4, by sampling solutions on the approx- 
imate path we obtain, computing duality gaps, and 
checking that the solutions are indeed e-approximate. 
We conclude that our experimental results match our 
theoretical analysis. Wc present the different path 
complexities in Table 1. 

Interestingly, the complexity of the pathological ex- 
ample significantly reduces when one is looking for an 
approximate solution. For example, for £ = 10~'^, the 
complexity of the approximate path is less than 0.5% 
the one of the full path. This significantly contrasts 
with the pessimistic result obtained in Section 3. As 
expected, the two examples based on real data exhibit 
a path complexity of the same order of the problem 
size, which also significantly reduces when e increases. 

5. Conclusion 

We have presented new results on the regularization 
path and thus on homotopy methods for the Lasso. 
First, we have shown that the path has an exponen- 
tial worst-case complexity, which, as far as we know, 
had never been formally proved before. Our second re- 
sult is more optimistic, and shows that when an exact 
path is not required, only a relatively small number of 
points on the path need to be computed. Finally, we 
propose a practical approximate homotopy algorithm, 
which can provide such approximate paths at a desired 
precision. 
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